AITopics | data governance

Collaborating Authors

data governance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Policy-Aware Generative AI for Safe, Auditable Data Access Governance

Mandalawi, Shames Al, Mohammed, Muzakkiruddin Ahmed, Maclean, Hendrika, Cakmak, Mert Can, Talburt, John R.

arXiv.org Artificial IntelligenceOct-28-2025

Enterprises need access decisions that satisfy least privilege, comply with regulations, and remain auditable. We present a policy aware controller that uses a large language model (LLM) to interpret natural language requests against written policies and metadata, not raw data. The system, implemented with Google Gemini~2.0 Flash, executes a six-stage reasoning framework (context interpretation, user validation, data classification, business purpose test, compliance mapping, and risk synthesis) with early hard policy gates and deny by default. It returns APPROVE, DENY, CONDITIONAL together with cited controls and a machine readable rationale. We evaluate on fourteen canonical cases across seven scenario families using a privacy preserving benchmark. Results show Exact Decision Match improving from 10/14 to 13/14 (92.9\%) after applying policy gates, DENY recall rising to 1.00, False Approval Rate on must-deny families dropping to 0, and Functional Appropriateness and Compliance Adherence at 14/14. Expert ratings of rationale quality are high, and median latency is under one minute. These findings indicate that policy constrained LLM reasoning, combined with explicit gates and audit trails, can translate human readable policies into safe, compliant, and traceable machine decisions.

governance, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.23474

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (0.49)
Research Report > Experimental Study (0.47)

Industry: Information Technology > Security & Privacy (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.52)

Add feedback

Congratulations to the #AIES2025 best paper award winners!

AIHubOct-21-2025, 11:54:04 GMT

The eighth AAAI / ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) is currently taking place in Madrid, Spain, running from 20-22 October. During the opening ceremony, the best papers for this year were announced. While it is well-known that AI systems might bring about unfair social impacts by influencing social schemas, much attention has been paid to instances where the content presented by AI systems explicitly demeans marginalized groups or reinforces problematic stereotypes. This paper urges critical scrutiny to be paid to instances that shape social schemas through subtler manners. Drawing from recent philosophical discussions on the politics of artifacts, we argue that many existing AI systems should be identified as what Liao and Huebner called oppressive things when they function to manifest oppressive normality.

ai system, aies2025 best paper award winner, congratulation, (11 more...)

AIHub

Country:

Europe > Spain > Galicia > Madrid (0.26)
Asia > China (0.06)
North America > United States (0.05)

Genre: Personal > Honors > Award (0.41)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.52)
Information Technology > Communications > Social Media (0.50)
Information Technology > Artificial Intelligence > Natural Language (0.49)

Add feedback

Copycats: the many lives of a publicly available medical imaging dataset Amelia Jiménez-Sánchez

Neural Information Processing SystemsOct-10-2025, 16:54:07 GMT

Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare.

dataset, documentation, mi dataset, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
(9 more...)

Genre: Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Copycats: the many lives of a publicly available medical imaging dataset

Jiménez-Sánchez, Amelia, Avlona, Natalia-Rozalia, Juodelyte, Dovile, Sourget, Théo, Vang-Larsen, Caroline, Rogers, Anna, Zając, Hubert Dariusz, Cheplygina, Veronika

arXiv.org Artificial IntelligenceSep-26-2025

Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.06353

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

AI-Driven Generation of Data Contracts in Modern Data Engineering Systems

Bhoite, Harshraj

arXiv.org Artificial IntelligenceJul-30-2025

Data contracts formalize agreements between data producers and consumers regarding schema, semantics, and quality expectations. As data pipelines grow in complexity, manual authoring and maintenance of contracts becomes error-prone and labor-intensive. We present an AI-driven framework for automatic data contract generation using large language models (LLMs). Our system leverages parameter-efficient fine-tuning methods, including LoRA and PEFT, to adapt LLMs to structured data domains. The models take sample data or schema descriptions and output validated contract definitions in formats such as JSON Schema and Avro. We integrate this framework into modern data platforms (e.g., Databricks, Snowflake) to automate contract enforcement at scale. Experimental results on synthetic and real-world datasets demonstrate that the fine-tuned LLMs achieve high accuracy in generating valid contracts and reduce manual workload by over 70%. We also discuss key challenges such as hallucination, version control, and the need for continuous learning. This work demonstrates that generative AI can enable scalable, agile data governance by bridging the gap between intent and implementation in enterprise data management.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.21056

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Lin, Weixiong, Ju, Chen, Wang, Haicheng, Hu, Shengchao, Xiao, Shuai, Chen, Mengting, Jiao, Yuheng, Yao, Mingshuai, Lan, Jinsong, Liu, Qingwen, Chen, Ying

arXiv.org Artificial IntelligenceMar-18-2025

Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.14559

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Poland (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
(2 more...)

Add feedback

Scalable Multi-Agent Reinforcement Learning for Residential Load Scheduling under Data Governance

Qin, Zhaoming, Dong, Nanqing, Liu, Di, Wang, Zhefan, Cao, Junwei

arXiv.org Artificial IntelligenceMar-4-2025

As a data-driven approach, multi-agent reinforcement learning (MARL) has made remarkable advances in solving cooperative residential load scheduling problems. However, centralized training, the most common paradigm for MARL, limits large-scale deployment in communication-constrained cloud-edge environments. As a remedy, distributed training shows unparalleled advantages in real-world applications but still faces challenge with system scalability, e.g., the high cost of communication overhead during coordinating individual agents, and needs to comply with data governance in terms of privacy. In this work, we propose a novel MARL solution to address these two practical issues. Our proposed approach is based on actor-critic methods, where the global critic is a learned function of individual critics computed solely based on local observations of households. This scheme preserves household privacy completely and significantly reduces communication cost. Simulation experiments demonstrate that the proposed framework achieves comparable performance to the state-of-the-art actor-critic framework without data governance and communication constraints.

agent, household, value function, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TICPS.2024.3501278

2110.02784

Country:

North America > United States (0.14)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Energy > Power Industry (1.00)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial IntelligenceJan-27-2025

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.

data mining, information retrieval, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2501.16605

Country:

Oceania > Australia > Tasmania (0.04)
Europe > United Kingdom (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Law (1.00)
Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Data Science > Data Quality (1.00)
(5 more...)

Add feedback

Data Stewardship Decoded: Mapping Its Diverse Manifestations and Emerging Relevance at a time of AI

Verhulst, Stefaan

arXiv.org Artificial IntelligenceJan-20-2025

Data stewardship has become a critical component of modern data governance, especially with the growing use of artificial intelligence (AI). Despite its increasing importance, the concept of data stewardship remains ambiguous and varies in its application. This paper explores four distinct manifestations of data stewardship to clarify its emerging position in the data governance landscape. These manifestations include a) data stewardship as a set of competencies and skills, b) a function or role within organizations, c) an intermediary organization facilitating collaborations, and d) a set of guiding principles. The paper subsequently outlines the core competencies required for effective data stewardship, explains the distinction between data stewards and Chief Data Officers (CDOs), and details the intermediary role of stewards in bridging gaps between data holders and external stakeholders. It also explores key principles aligned with the FAIR framework (Findable, Accessible, Interoperable, Reusable) and introduces the emerging principle of AI readiness to ensure data meets the ethical and technical requirements of AI systems. The paper emphasizes the importance of data stewardship in enhancing data collaboration, fostering public value, and managing data reuse responsibly, particularly in the era of AI. It concludes by identifying challenges and opportunities for advancing data stewardship, including the need for standardized definitions, capacity building efforts, and the creation of a professional association for data stewardship.

collaboration, data stewardship, governance, (9 more...)

arXiv.org Artificial Intelligence

2502.10399

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (0.94)
Law (0.69)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Towards Data Governance of Frontier AI Models

Hausenloy, Jason, McClements, Duncan, Thakur, Madhavendra

arXiv.org Artificial IntelligenceDec-4-2024

Data is essential to train and fine-tune today's frontier artificial intelligence (AI) models and to develop future ones. To date, academic, legal, and regulatory work has primarily addressed how data can directly harm consumers and creators, such as through privacy breaches, copyright infringements, and bias and discrimination. Our work, instead, focuses on the comparatively neglected question of how data can enable new governance capacities for frontier AI models. This approach for "frontier data governance" opens up new avenues for monitoring and mitigating risks from advanced AI models, particularly as they scale and acquire specific dangerous capabilities. Still, frontier data governance faces challenges that stem from the fundamental properties of data itself: data is non-rival, often non-excludable, easily replicable, and increasingly synthesizable. Despite these inherent difficulties, we propose a set of policy mechanisms targeting key actors along the data supply chain, including data producers, aggregators, model developers, and data vendors. We provide a brief overview of 15 governance mechanisms, of which we centrally introduce five, underexplored policy recommendations. These include developing canary tokens to detect unauthorized use for producers; (automated) data filtering to remove malicious content for pre-training and post-training datasets; mandatory dataset reporting requirements for developers and vendors; improved security for datasets and data generation algorithms; and know-your-customer requirements for vendors. By considering data not just as a source of potential harm, but as a critical governance lever, this work aims to equip policymakers with a new tool for the governance and regulation of frontier AI models.

arxiv, dataset, mechanism, (14 more...)

arXiv.org Artificial Intelligence

2412.03824

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > Monroe County > Rochester (0.04)
North America > United States > Massachusetts (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (0.88)

Industry:

Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
Law > Intellectual Property & Technology Law (0.68)
(4 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback